Add Model2Vec as an embedding backend #2245

MaartenGr · 2024-12-18T12:59:57Z

What does this PR do?

Add Model2Vec as an incredibly fast but still quite accurate embedding backend.

Usage is straightforward and you first need to install model2vec:

pip install model2vec

Then, you can load in any of their models and pass it to BERTopic like so:

from model2vec import StaticModel
embedding_model = StaticModel.from_pretrained("minishlab/potion-base-8M")

topic_model = BERTopic(embedding_model=embedding_model)

Distillation

These models are extremely versatile and can be distilled from existing embedding model (like those compatible with sentence-transformers). This distillation process doesn't require a vocabulary (as it uses the tokenizer's vocabulary) but can benefit from having one. Fortunately, this allows you to use the vocabulary from your input documents to distill a model yourself.

Doing so requires you to install some additional dependencies of model2vec like so:

pip install model2vec[distill]

To then distill common embedding models, you need to import the Model2VecBackend from BERTopic:

from bertopic.backend import Model2VecBackend

# Choose a model to distill (a non-Model2Vec model)
embedding_model = Model2VecBackend(
    "sentence-transformers/all-MiniLM-L6-v2", 
    distill=True
)

topic_model = BERTopic(embedding_model=embedding_model)

You can also choose a custom vectorizer for creating the vocabulary and define custom arguments for the distillatio process:

from bertopic.backend import Model2VecBackend
from sklearn.feature_extraction.text import CountVectorizer

# Choose a model to distill (a non-Model2Vec model)
embedding_model = Model2VecBackend(
    "sentence-transformers/all-MiniLM-L6-v2", 
    distill=True,
    distill_kwargs={"pca_dims": 256, "apply_zipf": True, "use_subword": True},
    distill_vectorizer=CountVectorizer(ngram_range=(1, 3))
)

topic_model = BERTopic(embedding_model=embedding_model)

Before submitting

This PR fixes a typo or improves the docs (if yes, ignore all other checks!).
Did you read the contributor guideline?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes (if applicable)?
Did you write any new necessary tests?

Add Model2Vec as an embedding backend

03765bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Model2Vec as an embedding backend #2245

Add Model2Vec as an embedding backend #2245

MaartenGr commented Dec 18, 2024

Add Model2Vec as an embedding backend #2245

Are you sure you want to change the base?

Add Model2Vec as an embedding backend #2245

Conversation

MaartenGr commented Dec 18, 2024

What does this PR do?

Distillation

Before submitting